We looked at essentially two kinds of algorithms. One is value iteration, which
essentially hinges on this Bellman equation. Bellman equation, very simple
observation, is that you can compute the utility of a state by taking the reward
given by the state plus the discounted utility of the best actions. Remember
actions here are non-deterministic, so the utility of the best action is given
as a sum of all the possible states we end up in after the action. Remember our
example here. Where is it? We have like for every action there's essentially
three things, three states, up to three states you can actually end up in. So the
sum in the Bellman equation in this particularly simple example is a
three-way sum. Okay and so what this Bellman equation gives us is essentially
an update function. We know what the rewards are, so we can basically compute
ahead and the idea about this value iteration algorithm is just that
we use this update and iterate that and hope for the best. And here's the
algorithm which is essentially copied down the math from the last slide.
Program, the update, have some kind of a termination criteria, some error we want
to go below and then we're done. A very simple idea, surprisingly it works.
And there is a good reason why it works, it's always the same reason with these
iterations, because we're reaching a fixed point because something
contracts. Essentially there's Banach's fixed point theorem. In the background
something gets smaller and if you iterate something that contracts you're
going to end up in a unique fixed point. The only trick here is to find out what
contracts and that's something you've looked at. And then there's various, as
you can imagine, variants of that. One of this is policy iteration which basically
says we take the instead of just looking only at the values, we
compute policies from given values. And then we kind of iterate
instead of on the values, we iterate on the policies. And still very similar, still
using the Bellman equation but converges a little bit faster in many cases.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:04:42 Min
Aufnahmedatum
2021-03-30
Hochgeladen am
2021-03-31 11:07:53
Sprache
en-US
Recap: Value/Policy Iteration
Main video on the topic in chapter 7 clip 4.